Modeling County-Level Voter Turnout Using the CDC’s Social Vulnerability Index (SVI)

C. Seth Lester, ASA, MAAA 2024-11-01

Background

This project was inspired by insights shared in a recent LinkedIn post where I explored whether the CDC’s Social Vulnerability Index (SVI) could accurately predict county-level voter turnout in US Presidential elections.

Public presidential polls often use likely voter models to re-weight raw polling samples, incorporating demographic factors such as race/ethnicity, poverty level, and education level, which overlap significantly with components of the SVI. In healthcare analytics, the SVI is commonly used to account and control for geographic variations in social determinants of health that can influence or confound causal relationships between healthcare interventions and consequent cost/utilization patterns for a population.

Motivated by Yubin Park’s concept of blending unrelated datasets to create “scientific curries,” I set out to investigate how social vulnerabilities might impact civic engagement - particularly through the lens of examining the relationship between social vulnerability and voter turnout at the county level of granularity.

Data Sources and Methods

The analysis relies on three primary data sources:

American Community Survey (ACS)

This data is accessed using a US Census Bureau API key and the tidycensus R package. This data provides demographic and population estimates for counties, which are integral in this analysis for determining county-level estimates of the Voting Age Population (VAP) and Voting Eligible Population (VEP). This work of staging this data is done in the R script src/get_vep_totals.R.

The VAP is calculated by determining the number of individuals age 18 and over in a particular county using 5-year ACS data. VEP is calculated by subtracting the count of non-citizen individuals aged 18 and over from VAP. While these counts might slightly overestimate true VAP/VEP due to the inclusion of certain ineligible groups (e.g., felons in some states), they provide a consistent and reliable set of denominators for county-level turnout analysis.

As there is generally a 1-2 year lag between the ACS measures for a time period and the time these measures are compiled and released by the US Census Bureau, this project uses VAP/VEP totals at the county level based on the ACS data for the five year period ending in YYYY - 2 for any presidential election year in YYYY. For example, the VEP/VAP measures used to calculate turnout rates at the county level for the 2016 election are determined using 5-year ACS measures from 2010 - 2014.

Ultimately, our goal is to devise a prediction model for turnout using ACS / SVI measures that tend to have a 1-2 year lag in availability. Ultimately, if we want to build a turnout model to predict 2024 election turnout, we will need to use 2018 - 2022 5-year ACS measures, as that will be the latest data available for constructing SVI measures.

MIT Election Lab

MIT Election Lab data contains historical county-level election returns, including data from 2012, 2016, and 2020. This data allows for a comprehensive analysis of turnout trends over multiple election cycles. This data is staged in the R script src/get_election_data.R.

This data is relatively straightforward, with one record per county that is joinable to the VEP/VAP data gathered from ACS and the SVI factors gathered from the CDC. One issue in this data is tabulations for county votes in Alaska. Alaska is uncommon in that the entirety of the state is not subdivided into counties - some people live outside of counties (Boroughs) - so for the sake of analyzing the relationship between SVI and turnout, I thought it best to remove Alaska vote data. We’re removing a very small piece of the sample. Sorry Alaska!

Caveat: don’t use this turnout model to predict Alaska results!

CDC Social Vulnerability Index (SVI)

The SVI is produced every 2-4 years by the CDC based on measures contained in the 5-year ACS data. This freely-downloadable dataset contains an overall SVI score for counties / census tracts in the US. The overall SVI score is further distilled from four component scores that measure social vulnerability on four categories: socioeconomic status, household composition, racial/ethnic minority status, and housing type/transportation.

Image of Captain Planet’s Planeteers summonning the various components of the SVI.

These scores (and their component pieces) are used subsequently in this analysis to quantify impact of social factors on turnout rates in Presidential elections. The SVI data required for this analysis is loaded and staged in the R script src/get_svi.R.

The process used by CDC/ATSDR to calculate SVI with the underlying ACS 5-year measures underwent a large number of changes in 2020. In order to evaluate the concern that SVI (or its components) are not sufficiently stable over time, I visually evaluated the SVI (and four underlying component measures) over time for the top 25 largest (by population) US counties to check that the SVI redesign in 2020 did not led to substantial volatility in the measure (or, at least, moreso than the actual year 2020 would have added to the measure).

2010 - 2022 SVI for Top 25 Most Populous US Counties

To better understand variation over time in the SVI measures, I examined the history of variation in the overall SVI measure for the top 25 counties in the US (ranked by population size).

First, we start with overall SVI, which ranks each county in the US by percentile of overall social vulnerability based on the component sum of the four social vulnerability themes:

The overall SVI for each county is determined by adding the sum of each of the percentile rankings of these four themes together for each particular county, and then percentile-ranking the overall sum for each county. It’s important to note that no one factor is given larger “weight” than another in this calculation, which makes the computation of SVI quite simple - but also might leave something to be desired in terms of accurately measuring social vulnerability.

Next, I examined the four component SVI themes over time, separately.

2010 - 2022 Socioeconomic Status Component for Top 25 Most Populous US Counties

2010 - 2022 Household Characteristics Component for Top 25 Most Populous US Counties

2010 - 2022 Racial & Ethnic Minority Status Component for Top 25 Most Populous US Counties

2010 - 2022 Housing Type & Transportation Component for Top 25 Most Populous US Counties

With the possible exception of the third component of SVI (the Racial & Ethnic Minority Status component), the four SVI components appear to be relatively stable over time when considering that in 2020 we could expect a considerable amount of variation in how SVI was measured due to both pandemic-related factors as well as underlying bias in ACS measurements sampled in 2020. Another possible explanation for the instability of the third component could be that in 2020 this component was heavily redefined.

When considering our intended task of building a turnout model with the SVI, we might consider using the underlying variables for this cateogory instead of the component percentile rank variables, as those are more stable over time.

Initial Examination

The goal of this project is to eventually use election returns turnout data for 2012, 2016, and 2020 to develop a turnout model that uses SVI data from the prior 2 years to predict turnout at the county level.

Prior to beginning this modeling exercise I wanted to better understand the distribution of our response - presidential election turnout rates at the county level.

A first pass examining the distribution of turnout

First we want to understand the distribution of turnout using both our denominators (VAP and VEP). Note that sometimes (in 11 cases) the total number of votes received in a county will equal or exceed estimates of VAP. This is typically due to counties that are VERY small with regard to population, so the ACS estimate of population is likely to be an undercount.

The 11 non-Alaska counties where this occurs are very sparsely-populated rural counties with merely hundreds (at most, 2,416) residents, and were won by both Democratic and Republican candidates. There is NOT sufficient data precision in ACS population estimates to prove anything about unauthorized people voting, so put those silly tinfoil hats away, please.

First, we want to join our election data (from one source) to our ACS 5-year population estimates for VEP and VAP (from another source). We will join on FIPS code and then calculate turnout rates using both VAP and VEP as denominators. Then we will analyze the distribution of both turnout measures and remove any outliers (perhaps due to very high turnout in very low-population areas).

We define an outlier for both VAP- and VEP-based turnout rates as any county for any election year with a turnout rate that exceeds 3 standard deviations above the mean of turnout. This process removes the 11 counties mentioned above with a VAP/VEP greater than 100%, but plus an additional 21 (extremely low population) counties with a turnout rate that is likely overestimated due to an undercounted denominator. With over 3,000 counties in our sample for each election year, we’re losing very little sample space by doing this!

Turnout over time

Since I intend to use data from 2012, 2016, and 2020 presidential elections to build the turnout model, I thought it best to next examine election turnout rates over time for the top 25 counties, as I did earlier with SVI and its four component themes.

Turnout vs. SVI in Presidential Elections

Finally, having combined turnout rates and corresponding 2-year-lagged SVI data for the three presidential election years, I plotted the relationship between overall SVI at the county level and county-level turnout rates.

The image above suggests a relatively moderate negative Pearson correlation (R) between SVI and turnout rates. I thought this was pretty darn interesting, so I posted a version of this chart on LinkedIn.

I then examined this relationship further by examining the correlation at the 4 primary SVI component measures as compared to turnout rates, where I also found remarkable stability within each SVI component measure over the three presidential elections.

Just from a cursory scan at Pearson correlation broken out by the four major SVI components, it looks like the real breadwinner varibles for a potential turnout model will come from the Socioeconomic Status and Housing Type/Transportation categories, with perhaps some useful information in the Household Characteristics category. It seems unintuitive that racial/ethnic identity would have a meaningful relationship with turnout rates for a county, but we’ll also consider component features from this SVI category as we build our prediction model.

Let’s Build a Turnout Model!

OK, enough description of the data - let’s use SVI to predict the future!

What are we modeling?

We want a model that will predict turnout for election year YYYY based on the SVI file for YYYY - 2. Also, it is my hope that our model will predict not absolute turnout (that is, number of votes in each county), but rather, a US county’s relative turnout expressed as a multiplicative factor of the state-level turnout.

Since different states will have different degrees of turnout relative to other factors (e.g., ad spending, swing state status, etc.), the model is intended to be used as a means to predict turnout relativities between counties in a particular state. I will demonstrate usage of the model in a final section to predict turnout in my home state of North Carolina in the 2024 election.

(Also, if you’re following along at home, I designed this model like a risk adjustment model.)

In short, we will train our model on a response that is the value of turnout multiplicitavely scaled by dividing the turnout rate variable for each county by its own average across all counties for each election year. Then, when our model predicts a turnout score of 1.00 +/- x, we will interpret that to mean that the county is expected to have a turnout equal to 1.00 +/- x times the average turnout across all counties for that election (represented by 1.00).

Should we include 2020 data?

Furthermore, 2020 presented some unique challenges (and unique motivations) that might not reflect more stable patterns reflecting the general relationship between SVI variables and voter turnout.

Before writing all this up, I examined the performance between two models - one trained including 2020 data, and one trained without 2020 data - and found no material difference in predictive performance.

Therefore, 2020 data is included.

Model Design and Feature Selection

Inaugural Feature Set

Ideally we would want a model that is trained on pre-2024 returns data, but predictive of turnout outcomes in 2024. The variables used to measure overall SVI across all four categories which are present across all versions of the SVI data available (2010 - 2022) are as follows:

SVI Feature Description
EPL_POV* Percentile percentage of persons below 100% (2010 - 2020) / 150% (2020+) poverty estimate
EPL_UNEMP Percentile percentage of civilian (age 16+) unemployed estimate
EPL_NOHSDP Percentile percentage of persons with no high school diploma (age 25+) estimate
EPL_AGE65 Percentile percentage of persons aged 65 and older estimate
EPL_AGE17 Percentile percentage of persons aged 17 and younger estimate
EPL_SNGPNT Percentile percentage of single-parent households with children under 18 estimate
EPL_LIMENG Percentile percentage of persons (age 5+) who speak English “less than well”
EPL_MINRTY Percentile percentage of persons who identify as a racial/ethnic identity in the minority
EPL_MUNIT Percentile percentage housing in structures with 10 or more units
EPL_MOBILE Percentile percentage mobile homes
EPL_CROWD Percentile percentage households with more than 1 person per room
EPL_NOVEH Percentile percentage households with no vehicle available
EPL_GROUPQ Percentile percentage of persons in group quarters estimate

Often a very good first step in building a predictive model is to get a handle on your feature space - including understanding their distribution. Since the EPL_* varaibles are percentile ranks, we can expect that these variables are all likely expressive of a uniform distribution across a support of 0 to 100%. Let’s confirm that now:

## Joining with `by = join_by(GEOID, year)`

We have now confirmed that every variable in the feature space reflects a Uniform(0,100) distribution. This might not be an approach that will lead to a feature space that sets us up with a high-performing predictive model, so let’s investigate the non-percentile-ranked (i.e., estimates of proportions) variables underlying these within the loaded SVI datasets.

Each of these percentile ranked measurements are based on a US-wide percentile ranking of underlying proportion measures that are drawn directly from ACS 5-year data for the applicable time period. Each of these proportion measures is also present in SVI datasets.

SVI Feature Description
EP_POV* Percentage of persons below 100% (2010 - 2020) / 150% (2020+) poverty estimate
EP_UNEMP Percentage of civilian (age 16+) unemployed estimate
EP_NOHSDP Percentage of persons with no high school diploma (age 25+) estimate
EP_AGE65 Percentage of persons aged 65 and older estimate
EP_AGE17 Percentage of persons aged 17 and younger estimate
EP_SNGPNT Percentage of single-parent households with children under 18 estimate
EP_LIMENG Percentage of persons (age 5+) who speak English “less than well”
EP_MINRTY Percentage of persons who identify as a racial/ethnic identity in the minority
EP_MUNIT Percentage housing in structures with 10 or more units
EP_MOBILE Percentage mobile homes
EP_CROWD Percentage households with more than 1 person per room
EP_NOVEH Percentage households with no vehicle available
EP_GROUPQ Percentage of persons in group quarters estimate
## Joining with `by = join_by(GEOID, year)`

Now we will build some candidate models for our final predictive model for relative county-level turnout. This gives us an overview of how this feature set’s relationship with the target response (scaled_turnout) are described by the underlying data.

## 
## Call:
## lm(formula = formula_ep, data = all_years)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.47285 -0.06017 -0.00932  0.05096  0.62296 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.216e+00  1.852e-02  65.665  < 2e-16 ***
## EP_POV      -4.063e-03  2.664e-04 -15.249  < 2e-16 ***
## EP_UNEMP    -1.242e-03  3.977e-04  -3.123  0.00180 ** 
## EP_NOHSDP   -3.424e-03  2.673e-04 -12.808  < 2e-16 ***
## EP_AGE65     5.056e-04  3.741e-04   1.352  0.17654    
## EP_AGE17    -1.460e-03  5.338e-04  -2.734  0.00627 ** 
## EP_SNGPNT   -7.193e-03  5.330e-04 -13.493  < 2e-16 ***
## EP_LIMENG   -2.637e-03  5.795e-04  -4.550 5.43e-06 ***
## EP_MINRTY    2.204e-03  8.447e-05  26.097  < 2e-16 ***
## EP_MUNIT    -5.470e-04  2.972e-04  -1.840  0.06578 .  
## EP_MOBILE    2.681e-04  1.550e-04   1.730  0.08370 .  
## EP_CROWD    -6.706e-03  8.149e-04  -8.229  < 2e-16 ***
## EP_NOVEH    -3.057e-03  3.928e-04  -7.781 7.96e-15 ***
## EP_GROUPQ   -1.165e-02  2.971e-04 -39.218  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09746 on 9277 degrees of freedom
## Multiple R-squared:  0.3978, Adjusted R-squared:  0.3969 
## F-statistic: 471.3 on 13 and 9277 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = formula_epl, data = all_years)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46972 -0.05836 -0.00347  0.05483  0.64986 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.153374   0.007673 150.319  < 2e-16 ***
## EPL_POV     -0.074055   0.006143 -12.055  < 2e-16 ***
## EPL_UNEMP   -0.017210   0.004770  -3.608 0.000311 ***
## EPL_NOHSDP  -0.141038   0.006081 -23.193  < 2e-16 ***
## EPL_AGE65    0.037088   0.005111   7.257 4.28e-13 ***
## EPL_AGE17    0.015648   0.005177   3.022 0.002514 ** 
## EPL_SNGPNT  -0.058820   0.005571 -10.559  < 2e-16 ***
## EPL_LIMENG  -0.033557   0.004870  -6.890 5.95e-12 ***
## EPL_MINRTY   0.132302   0.005455  24.253  < 2e-16 ***
## EPL_MUNIT   -0.047356   0.005250  -9.021  < 2e-16 ***
## EPL_MOBILE  -0.006207   0.005429  -1.143 0.252902    
## EPL_CROWD   -0.009761   0.004717  -2.069 0.038554 *  
## EPL_NOVEH   -0.026778   0.005033  -5.320 1.06e-07 ***
## EPL_GROUPQ  -0.137538   0.004014 -34.263  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09689 on 9277 degrees of freedom
## Multiple R-squared:  0.4047, Adjusted R-squared:  0.4039 
## F-statistic: 485.2 on 13 and 9277 DF,  p-value: < 2.2e-16

Out of the gate, our basic vanilla linear models explain the data quite well, with an adjusted R-squared hovering around 40% and with a feature space filled with highly-significant variables, for both the model based on EP_ (proportion estimates) variables as well as the model based on EPL_ (percentile ranked) variables in the SVI. So, that’s great news!

Given that we are getting similar predictive performance from both the EP_ and EPL_ series variables, I’ve chosen to proceed with the EP_ variables because the coefficients for our model will have a more intuitive, commonsense interpretation than with the EPL_ variables. The coefficients corresponding to EP_ variables allow us to make statements like “For every 1% increase in the estimated proportion of X, we can expect a Y% increase/decrease in turnout for that county.” You know, for the sake of model explainability and all!

Evaluating Multicollinearity in the Feature Set

We might imagine there is substantial multicollinearity in the feature space, so we should be aware of any strong correlations between our 13 features going into the modeling project. As a matter of good model design practice, let’s take a peek at a correlation matrix for our SVI features. We’ll see nothing here that doesn’t make a lot of sense.

Motivating Regularization

Here, we see some obvious correlation between several of these features. One obvious example to call out is the moderate negative correlation between EP_AGE17 and EP_AGE65. Our hope is that we can rely on regularization to mute (or deselect entirely) some of these features where multicollinearity between certain features can inject bias into our predictive model.

Using R’s glmnet package, we can use regularization to arrive at a more parsimonious (i.e., potentially fewer features) model with at least similar predictive performance to the basic OLS models we fitted to SVI variables earlier. This approach will also give us the ability to consider any number of interactions between these features, where interactions with a less-than-impactful contribution to predicting the response will be “penalized” out through the regularization process.

As mentioned above approach will also have the benefit of hopefully removing (via regularization penalization) some of the SVI features where there is excessive multicollinearity (if any). Using glmnet, I’ve fixed the alpha parameter to 1, forcing a lasso regularization regime with L2 distances penalization, which is known to be better at “culling” the feature space to arrive at a more parsimonious model with fewer features).

Refining the Feature Space

Before fitting the regularized model, several refinements to the feature set are warranted:

Removing EP_POV: The poverty variable (EP_POV) changed its definition from 100% of the federal poverty level (in SVI 2010-2018) to 150% of the poverty level (in SVI 2020+). This introduces a structural break in the feature across the training data, making it unreliable as a predictor. It is excluded from the final model.

Log-transforming skewed variables: Several EP_ variables (EP_LIMENG, EP_CROWD, EP_MUNIT) exhibit strong right-skew, with many counties clustered near zero. Applying a log(1 + x) transformation improves linearity with the response.

Adding log(population): County total population (E_TOTPOP, already present in the SVI data) serves as a proxy for county scale and urbanicity. We include log(E_TOTPOP) to capture the well-known relationship between county size and turnout patterns.

Adding election year: Including the election year as a numeric feature allows the model to capture secular shifts in the relationship between SVI variables and turnout over time. Lasso regularization will zero this out if it is not informative.

Weighting by county size: Counties vary enormously in population, and ACS estimates for small counties have substantially higher standard errors. Weighting observations by log(VEP) in the loss function allows the model to focus on counties where the signal-to-noise ratio is highest.

Leave-one-year-out cross-validation: Rather than random 10-fold CV (which mixes counties from different election years in each fold), we use leave-one-year-out CV. This trains on two election cycles and tests on the third, directly measuring whether the model generalizes across elections — the actual use case.

Lasso Model Results

Using the refined feature matrix (with all pairwise interactions), below shows a chart with the regressor values for primary variables and interactions that emerged from the glmnet process (where features where the absolute value of the regressor term is greater than .0005, as well as the intercept, are both excluded).

## [1] "Leave-one-year-out CV R-squared (on scaled turnout): 0.4973"

Testing the Model on 2024 Results

Now that we have a trained model, let’s see how it performs on completely unseen data: the 2024 presidential election. The model was trained exclusively on 2012, 2016, and 2020 data, so 2024 represents a true out-of-sample validation.

North Carolina: A Closer Look

First, let’s zoom in on my home state of North Carolina, where the model performs quite well on the 2024 data.

LOCATION Predicted_Votes Actual_Votes Absolute_Error_Pct
Alamance County, North Carolina 94521 89831 5.22%
Alexander County, North Carolina 21140 20677 2.24%
Alleghany County, North Carolina 6597 6496 1.55%
Anson County, North Carolina 11691 10875 7.50%
Ashe County, North Carolina 16098 16253 0.95%
Avery County, North Carolina 9141 9489 3.67%
Beaufort County, North Carolina 25899 26572 2.53%
Bertie County, North Carolina 10132 9186 10.30%
Bladen County, North Carolina 17273 16764 3.04%
Brunswick County, North Carolina 97205 109378 11.13%
Buncombe County, North Carolina 166496 160510 3.73%
Burke County, North Carolina 49063 45847 7.01%
Cabarrus County, North Carolina 127578 120202 6.14%
Caldwell County, North Carolina 43357 43540 0.42%
Camden County, North Carolina 6225 6304 1.25%
Carteret County, North Carolina 43071 45817 5.99%
Caswell County, North Carolina 13078 12040 8.62%
Catawba County, North Carolina 88949 87109 2.11%
Chatham County, North Carolina 47385 52301 9.40%
Cherokee County, North Carolina 18831 17824 5.65%
Chowan County, North Carolina 7877 7552 4.30%
Clay County, North Carolina 6921 7728 10.44%
Cleveland County, North Carolina 54445 51706 5.30%
Columbus County, North Carolina 27406 26402 3.80%
Craven County, North Carolina 56187 56173 0.02%
Cumberland County, North Carolina 170093 140513 21.05%
Currituck County, North Carolina 18231 18053 0.99%
Dare County, North Carolina 25162 25196 0.13%
Davidson County, North Carolina 96294 93452 3.04%
Davie County, North Carolina 25517 26850 4.96%
Duplin County, North Carolina 23038 22898 0.61%
Durham County, North Carolina 176527 180912 2.42%
Edgecombe County, North Carolina 26723 24448 9.31%
Forsyth County, North Carolina 205895 204726 0.57%
Franklin County, North Carolina 39148 42667 8.25%
Gaston County, North Carolina 129732 119256 8.78%
Gates County, North Carolina 6565 5868 11.88%
Graham County, North Carolina 4478 4779 6.30%
Granville County, North Carolina 33229 32104 3.50%
Greene County, North Carolina 8815 8450 4.32%
Guilford County, North Carolina 291496 285053 2.26%
Halifax County, North Carolina 26486 23965 10.52%
Harnett County, North Carolina 73845 63757 15.82%
Haywood County, North Carolina 38668 37851 2.16%
Henderson County, North Carolina 73629 69974 5.22%
Hertford County, North Carolina 11658 9843 18.44%
Hoke County, North Carolina 27570 22767 21.10%
Hyde County, North Carolina 2544 2421 5.08%
Iredell County, North Carolina 110927 110875 0.05%
Jackson County, North Carolina 23781 21942 8.38%
Johnston County, North Carolina 123333 124678 1.08%
Jones County, North Carolina 5590 5463 2.32%
Lee County, North Carolina 32258 30081 7.24%
Lenoir County, North Carolina 27960 27503 1.66%
Lincoln County, North Carolina 53579 55582 3.60%
McDowell County, North Carolina 24213 23655 2.36%
Macon County, North Carolina 22763 21934 3.78%
Madison County, North Carolina 12846 13621 5.69%
Martin County, North Carolina 12445 12040 3.36%
Mecklenburg County, North Carolina 631990 577505 9.43%
Mitchell County, North Carolina 8485 8842 4.04%
Montgomery County, North Carolina 13924 13206 5.44%
Moore County, North Carolina 63371 61790 2.56%
Nash County, North Carolina 51757 52471 1.36%
New Hanover County, North Carolina 142128 138734 2.45%
Northampton County, North Carolina 10690 9215 16.01%
Onslow County, North Carolina 99666 81681 22.02%
Orange County, North Carolina 80800 87807 7.98%
Pamlico County, North Carolina 7927 7976 0.61%
Pasquotank County, North Carolina 22594 20343 11.07%
Pender County, North Carolina 35898 38909 7.74%
Perquimans County, North Carolina 8082 7666 5.43%
Person County, North Carolina 23667 22036 7.40%
Pitt County, North Carolina 90605 87130 3.99%
Polk County, North Carolina 13153 13068 0.65%
Randolph County, North Carolina 78172 76008 2.85%
Richmond County, North Carolina 21218 19873 6.77%
Robeson County, North Carolina 55314 46770 18.27%
Rockingham County, North Carolina 49021 49595 1.16%
Rowan County, North Carolina 78760 75394 4.46%
Rutherford County, North Carolina 36299 34670 4.70%
Sampson County, North Carolina 28371 28201 0.60%
Scotland County, North Carolina 16249 14626 11.10%
Stanly County, North Carolina 35295 36714 3.87%
Stokes County, North Carolina 27084 27175 0.33%
Surry County, North Carolina 37355 37508 0.41%
Swain County, North Carolina 7621 7052 8.07%
Transylvania County, North Carolina 21513 20780 3.53%
Tyrrell County, North Carolina 1757 1757 0.00%
Union County, North Carolina 141276 139355 1.38%
Vance County, North Carolina 22420 20092 11.59%
Wake County, North Carolina 680297 653580 4.09%
Warren County, North Carolina 10849 10013 8.35%
Washington County, North Carolina 6365 5944 7.08%
Watauga County, North Carolina 30977 33095 6.40%
Wayne County, North Carolina 59554 54762 8.75%
Wilkes County, North Carolina 34834 36320 4.09%
Wilson County, North Carolina 40865 40045 2.05%
Yadkin County, North Carolina 21034 20397 3.12%
Yancey County, North Carolina 11080 11283 1.80%
## # A tibble: 1 × 3
##   Predicted_Votes Actual_Votes MAPE 
##             <int>        <dbl> <chr>
## 1         5909921      5699141 5.62%

## [1] "NC R-squared (scaled turnout): 0.6493"

National 2024 Validation

Having demonstrated the model’s performance on North Carolina, let’s now evaluate how the model performs across all US states (excluding Alaska) in the 2024 presidential election.

## [1] "National R-squared (scaled turnout): 0.4679"

## [1] "National MAE: 3438 votes"

## [1] "National MAPE: 7.26%"

## [1] "Counties evaluated: 3072"

State-Level Breakdown

State N_Counties Total_Actual Total_Predicted Total_Error_Pct MAE MAPE R_Squared
CA 58 15862678 16199887 2.13% 12047 6.60% 0.5460
TX 250 11387360 11596840 1.84% 3376 10.87% 0.4900
FL 67 10893752 11720480 7.59% 15003 7.40% 0.5527
NY 62 8262495 7763967 6.03% 9881 8.69% 0.6933
PA 67 7034206 7182482 2.11% 3486 4.78% 0.7018
OH 88 5767788 5945850 3.09% 3486 4.06% 0.8056
NC 100 5699141 5909968 3.70% 3167 5.62% 0.6493
MI 83 5664186 5959316 5.21% 4105 4.58% 0.6153
IL 102 5633310 5922508 5.13% 3815 4.76% 0.6521
GA 158 5248353 5458558 4.01% 2447 9.06% 0.5844
VA 133 4482576 4755598 6.09% 2803 6.83% 0.7335
NJ 21 4272725 4560349 6.73% 14003 5.92% 0.8429
WA 39 3924243 4291091 9.35% 10217 6.59% 0.5989
MA 14 3473668 3666895 5.56% 15287 5.55% 0.6886
WI 72 3422918 3650889 6.66% 3309 7.48% 0.7821
AZ 15 3412953 3646607 6.85% 18729 8.67% 0.1798
MN 87 3253920 3488831 7.22% 2882 6.96% 0.4305
CO 61 3169115 3479863 9.81% 5330 9.07% 0.5793
TN 95 3063942 3168839 3.42% 2681 5.90% 0.5188
MD 24 3038334 3231753 6.37% 8770 4.97% 0.7843
IN 92 2936677 2999841 2.15% 1987 4.48% 0.7215
MO 115 2871039 2978304 3.74% 2513 7.89% 0.4513
SC 46 2548140 2597442 1.93% 2535 6.04% 0.4997
AL 67 2264972 2285343 0.90% 1962 6.76% 0.3360
OR 36 2244493 2371007 5.64% 4279 5.28% 0.4553
KY 120 2074530 2051083 1.13% 924 6.53% 0.6011
LA 64 2006975 1952339 2.72% 2351 10.60% 0.2879
IA 99 1674011 1740455 3.97% 1054 6.17% 0.4188
OK 77 1566173 1592185 1.66% 1213 6.38% 0.5824
UT 28 1487944 1702580 14.43% 8036 7.68% 0.1830
NV 17 1484840 1563282 5.28% 7052 9.09% 0.5588
KS 105 1327591 1398320 5.33% 896 7.68% 0.5295
MS 82 1228008 1194542 2.73% 1126 9.52% 0.3399
AR 72 1165888 1161816 0.35% 1214 8.33% 0.5332
NE 93 947159 1007597 6.38% 762 6.95% 0.4999
ID 44 905057 969188 7.09% 1820 9.58% 0.3677
NH 10 826189 889399 7.65% 6615 5.44% 0.5076
ME 16 824806 863134 4.65% 2531 4.34% 0.6715
NM 16 809679 840895 3.86% 2959 8.68% 0.6120
WV 55 762390 726293 4.73% 932 7.71% 0.5818
MT 55 602163 635947 5.61% 908 9.05% 0.5143
HI 4 516701 534780 3.50% 9135 6.30% 0.5872
RI 5 511816 510252 0.31% 5562 6.35% 0.3655
DE 3 511697 545006 6.51% 15127 7.39% 0.4884
SD 65 425860 433859 1.88% 333 7.65% 0.6312
VT 14 372885 381514 2.31% 838 3.68% 0.6766
ND 52 367508 370178 0.73% 425 6.98% 0.4408
DC 1 325879 288559 11.45% 37320 11.45% NA
WY 23 269048 289065 7.44% 1191 6.72% 0.5228

State-Level Prediction Performance (2024)